Author: Hang He

GitHub Page: https://GavinHHE.github.io.

Class: CMPS 3160

Dataset Source:

From CDC: PLACES: Local Data for Better Health, Census Tract Data 2020 release:https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh

From Kaggle: Cardio Vascular Disease Detection: https://www.kaggle.com/bhadaneeraj/cardio-vascular-disease-detection

Diabetes Health Indicators Dataset: https://www.kaggle.com/alexteboul/diabetes-health-indicators-dataset

Code Reference: https://plotly.com/python/choropleth-maps/

Project Plan & Project goals:

  1. Introduction

  2. ETL and EDA

  3. Model Construction and Evaluation

  4. Conclusion

Introduction:

Background and Project Goals

From heart.org, an artical states that nearly half of American adults have high blood pressure. As we know, most of the time, high blood pressure (HBP, or hypertension) has no obvious symptoms to indicate that something is wrong. It develops slowly over time and can be related to many causes. For the project, I will deep dive the health data from CDC and Cardiovascular Disease data from Kaggle by visualization and analysis. The final report will include a visualization of percentage of population that has HBP by states and analysis on the importance of HBP as a risk factor to Cardiovascular Disease and Diabetes. I will also include machine learning models to make predictions using available data. Hopefully, models would be able to predict whether a specific person has cardiovascular Disease or Diabetes accurately.

About the dataset

All data can be aslo found in the repository.

Census Tract Data 2020 release(2017 to 2018) is filled with data regarding the overall responses of surveys conducted by multiple organizations. Columns in the datasets includes when and where the survey was conducted, total population involved, descriptions of the question asked, and the responses value in percentage. I will use this dataset to visualize HBP rate and perform some basic caulation.

Cardio Vascular Disease Detection is filled with the data regarding people both with and without Cardiovascular Disease. Personal information including age, gender, height, weight, blood pressure measurement and etc. The dataset also have columns indicating smoking, drinking and exercise status. I will also assess those three risk factors in the machine learning part.

Diabetes Health Indicators Dataset is filled with the data regarding people both with and without Diabetes.Columns include whether the person smoking or drinking, age group, education level, income level, gender and etc. There is no missing values in dataset. Many of the columns are catergorical or boolean variables.

For all three dataset, there is no missing value. Data from Census Tract Data 2020 release are clean and do not need futhur data cleaning. There are many extreme values that are unrealistic in Cardio Vascular Disease Detection. I will deal with those values by droping or transforming. Outliers in Census Tract Data are not removed in the EDA part, but I will transform or drop outliers when doing pridictions. Most outlier in Diabetes Health Indicators Dataset are removed in the EAD part.

The main question for my project is "How risky HBP is? Will it cause people to have Cardiovascular Disease or Diabetes?". In addition to that, I would also like to analyze other important risk factors that are related to diseases.

ETL and EDA:

here I will load the Census Tract Data 2020 release. There is a description about the columns on the website: https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh. Since I am only interested in the survey results about blood pressure, I will slice the data and choose columns: ['Year','StateAbbr','StateDesc','CountyName','Measure','Data_Value_Unit','Data_Value','Geolocation']

Load Cardio Vascular Disease Detection data

Features:

Age | Objective Feature | age | int (days)

Height | Objective Feature | height | int (cm) |

Weight | Objective Feature | weight | float (kg) |

Gender | Objective Feature | gender | categorical code |

Systolic blood pressure | Examination Feature | ap_hi | int |

Diastolic blood pressure | Examination Feature | ap_lo | int |

Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |

Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |

Smoking | Subjective Feature | smoke | binary |

Alcohol intake | Subjective Feature | alco | binary |

Physical activity | Subjective Feature | active | binary |

Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

Remove extreme values

There is no explaination on the meaning of 1 and 2 of gender column. After compared the mean weight and height, I figured out Gender value 2 is male, Gender value 1 is female.

Accoring to the documentation, age is stored in days.

By looking at the box plots of height,weight,ap_hi and ap_lo, there are many extreme values that some are unrealistic. I will remove the unrealistic values. Outliers could be meaning here, I will transform outliers later.

According to https://en.wikipedia.org/wiki/List_of_the_verified_shortest_people, the shortest recorded is 54.6. I will remove rows that has highet lower than 55

I observed many outliers in both ap_hi and ap_low. According to https://pubmed.ncbi.nlm.nih.gov/7741618/, the highest highest pressure recorded is 370/360, I will remove rows that has ap_hi or ap_low higher than 360

It is unrealistic to have living person with DIASTOLIC pressure equals to or greater than SYSTOLIC pressure

It is unrealistic to have living person wit DIASTOLIC and SYSTOLIC pressures less than 50

Accoring to the documentation, gender,cholesterol,gluc,smoke,alco,active, and cardio are category variables

Values of 1, 2 and 3 are hard to interpret for columns cholesterol and gluc. I maped both columns according to the data spcification provided.

By looking at the distribution of those 5 columns, weight and height seems follow normal distribution. There are some outliers in weight and height, but the number of outliers is too small to be shown on the grah. Age is postively skewed. It is hard to tell the distribution of ap_lo and ap_hi.

Here, I compared the mean value for those 5 columns. The graphs show that people with cardiovascular disease are slightly orlder and have higher blood pressure measurement.

I visualized the number of smokers and drinkers among people with cardiovascular diseases. From the graph, I would say the smoking or drinking may has no significant impact on cardiovascular diseases.

I also created a heatmap to show the correlation between variables. The disease indicator, cardio, has high correlation with ap_hi and ap_low. Age and weight are also correlated with cardio.

Conver the data type according to the column description provided by data uploader

Since most of columns are either catergorcial variables or boolean variables, I will check the distribution of BMI only

In rare cases people have BMI greater than 50, I will drop rows with BMI higher than 50.

There are few outliers, the potential effect of outliers in this case should be small since we have only few outliers

BMI data seems follow normal distribtion. From another graph, I observed that people with Diabetes are tend to have higher BMI.

Among people who smokes, there is a higher chance that he/she has diagnosed with Diabetes already. Among people who have high blood pressure, there is a significant higher chance that he/she has diagnosed with Diabetes already.

I observed that the correlations between Diabetes_binary and HighBP/ BMI are relatively higher. There are also high correlation between PhysHlth and DiffWalk and between MentHleth and PhysHlth. I will remove PhysHlth when applying classification model on it.